AITopics | word alignment

Collaborating Authors

word alignment

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

Ebing, Benedikt, Goldschmied, Christian, Glavaš, Goran

arXiv.org Artificial IntelligenceNov-3-2025

In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.

artificial intelligence, computational linguistic, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.27337

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Bridging Language Gaps: Advances in Cross-Lingual Information Retrieval with Multilingual LLMs

Goworek, Roksana, Macmillan-Scott, Olivia, Özyiğit, Eda B.

arXiv.org Artificial IntelligenceOct-2-2025

Cross-lingual information retrieval (CLIR) addresses the challenge of retrieving relevant documents written in languages different from that of the original query. Research in this area has typically framed the task as monolingual retrieval augmented by translation, treating retrieval methods and cross-lingual capabilities in isolation. Both monolingual and cross-lingual retrieval usually follow a pipeline of query expansion, ranking, re-ranking and, increasingly, question answering. Recent advances, however, have shifted from translation-based methods toward embedding-based approaches and leverage multilingual large language models (LLMs), for which aligning representations across languages remains a central challenge. The emergence of cross-lingual embeddings and multilingual LLMs has introduced a new paradigm, offering improved retrieval performance and enabling answer generation. This survey provides a comprehensive overview of developments from early translation-based methods to state-of-the-art embedding-driven and generative techniques. It presents a structured account of core CLIR components, evaluation practices, and available resources. Persistent challenges such as data imbalance and linguistic variation are identified, while promising directions are suggested for advancing equitable and effective cross-lingual information retrieval. By situating CLIR within the broader landscape of information retrieval and multilingual language processing, this work not only reviews current capabilities but also outlines future directions for building retrieval systems that are robust, inclusive, and adaptable.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.00908

Country:

Europe (1.00)
Asia > Middle East (1.00)
North America > United States > California (0.67)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Education (1.00)
Government (0.67)
Media > News (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Autoencoder Approach to Learning Bilingual Word Representations

Neural Information Processing SystemsSep-30-2025, 08:49:02 GMT

Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.

autoencoder approach, learning bilingual word representation, name change, (7 more...)

Neural Information Processing Systems

Industry: Education > Curriculum > Subject-Specific Education (0.51)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.60)

Add feedback

Whisper Has an Internal Word Aligner

Yeh, Sung-Lin, Meng, Yen, Tang, Hao

arXiv.org Artificial IntelligenceSep-15-2025

Abstract--There is an increasing interest in obtaining accurate word-level timestamps from strong automatic speech recognizers, in particular Whisper . Existing approaches either require additional training or are simply not competitive. The evaluation in prior work is also relatively loose, typically using a tolerance of more than 200 ms. In this work, we discover attention heads in Whisper that capture accurate word alignments and are distinctively different from those that do not. Moreover, we find that using characters produces finer and more accurate alignments than using wordpieces. Based on these findings, we propose an unsupervised approach to extracting word alignments by filtering attention heads while teacher forcing Whisper with characters. Our approach not only does not require training but also produces word alignments that are more accurate than prior work under a stricter tolerance between 20 ms and 100 ms.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.09987

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Javorský, Dávid, Bojar, Ondřej, Yvon, François

arXiv.org Artificial IntelligenceJun-6-2025

In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students' curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2506.04848

Country:

Europe (1.00)
Asia > Middle East > UAE (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.82)

Industry: Education > Curriculum > Subject-Specific Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Conditional Random Field Autoencoders for Unsupervised Structured Prediction

Waleed Ammar, Chris Dyer, Noah A. Smith

Neural Information Processing SystemsFeb-12-2025, 01:00:06 GMT

We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observed data using a feature-rich conditional random field (CRF). Then a reconstruction of the input is (re)generated, conditional on the latent structure, using a generative model which factorizes similarly to the CRF. The autoencoder formulation enables efficient exact inference without resorting to unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate connections to traditional autoencoders, posterior regularization, and multi-view learning. We then show competitive results with instantiations of the framework for two canonical tasks in natural language processing: part-of-speech induction and bitext word alignment, and show that training the proposed model can be substantially more efficient than a comparable feature-rich baseline.

artificial intelligence, autoencoder, machine learning, (13 more...)

Neural Information Processing Systems

Country: North America > United States (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.69)

Add feedback

An Autoencoder Approach to Learning Bilingual Word Representations

Neural Information Processing SystemsJan-17-2025, 14:17:38 GMT

alignment, autoencoder approach, learning bilingual word representation, (3 more...)

Neural Information Processing Systems

Industry: Education > Curriculum > Subject-Specific Education (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.65)

Add feedback

BinaryAlign: Word Alignment as Binary Sequence Labeling

Latouche, Gaetan Lopez, Carbonneau, Marc-André, Swanson, Ben

arXiv.org Artificial IntelligenceJul-16-2024

Real world deployments of word alignment are almost certain to cover both high and low resource languages. However, the state-of-the-art for this task recommends a different model class depending on the availability of gold alignment training data for a particular language pair. We propose BinaryAlign, a novel word alignment technique based on binary sequence labeling that outperforms existing approaches in both scenarios, offering a unifying approach to the task. Additionally, we vary the specific choice of multilingual foundation model, perform stratified error analysis over alignment error type, and explore the performance of BinaryAlign on non-English language pairs. We make our source code publicly available.

alignment, computational linguistic, word alignment, (15 more...)

arXiv.org Artificial Intelligence

2407.12881

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Africa > Middle East > Egypt > Giza Governorate > Giza (0.04)
North America > Canada > Ontario > Toronto (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Word Order Synchronization Metric for Evaluating Simultaneous Interpretation and Translation

Makinae, Mana, Sudoh, Katsuhito, Yamada, Mararu, Nakamura, Satoshi

arXiv.org Artificial IntelligenceJul-9-2024

Simultaneous interpretation (SI), the translation of one language to another in real time, starts translation before the original speech has finished. Its evaluation needs to consider both latency and quality. This trade-off is challenging especially for distant word order language pairs such as English and Japanese. To handle this word order gap, interpreters maintain the word order of the source language as much as possible to keep up with original language to minimize its latency while maintaining its quality, whereas in translation reordering happens to keep fluency in the target language. This means outputs synchronized with the source language are desirable based on the real SI situation, and it's a key for further progress in computational SI and simultaneous machine translation (SiMT). In this work, we propose an automatic evaluation metric for SI and SiMT focusing on word order synchronization. Our evaluation metric is based on rank correlation coefficients, leveraging cross-lingual pre-trained language models. Our experimental results on NAIST-SIC-Aligned and JNPC showed our metrics' effectiveness to measure word order synchronization between source and target language.

computational linguistic, synchronization, translation, (13 more...)

arXiv.org Artificial Intelligence

2407.0665

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.05)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(6 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Cross-Lingual Word Alignment for ASEAN Languages with Contrastive Learning

Zhang, Jingshen, Qiu, Xinying, Shen, Teng, Wang, Wenyu, Zhang, Kailin, Feng, Wenhe

arXiv.org Artificial IntelligenceJul-6-2024

Cross-lingual word alignment plays a crucial role in various natural language processing tasks, particularly for low-resource languages. Recent study proposes a BiLSTM-based encoder-decoder model that outperforms pre-trained language models in low-resource settings. However, their model only considers the similarity of word embedding spaces and does not explicitly model the differences between word embeddings. To address this limitation, we propose incorporating contrastive learning into the BiLSTM-based encoder-decoder framework. Our approach introduces a multi-view negative sampling strategy to learn the differences between word pairs in the shared cross-lingual embedding space. We evaluate our model on five bilingual aligned datasets spanning four ASEAN languages: Lao, Vietnamese, Thai, and Indonesian. Experimental results demonstrate that integrating contrastive learning consistently improves word alignment accuracy across all datasets, confirming the effectiveness of the proposed method in low-resource scenarios. We will release our data set and code to support future research on ASEAN or more low-resource word alignment.

alignment, contrastive learning, word alignment, (12 more...)

arXiv.org Artificial Intelligence

2407.05054

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Africa > Middle East > Egypt > Giza Governorate > Giza (0.05)
Asia > China > Guangdong Province > Guangzhou (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback